Excel Extractor

ETK's Excel Extractor is a cell-based extractor for extracting data from compatible spreadsheets.

Souce spreadsheet

The example spreadsheet file named alabama.xml and it has a sheet named 16tbl08al, in which row 1 to row 5 and row 60 to row 62 are metadata, 6A to M59 is a table (which has row and column headers). For this example, I'm going to extract data from C7 to M33 (see the picture attached below).

Define where and how to extract data

Excel Extractor will scan cell-by-cell within a region that you specified and populate variables that you defined.

Define variable

In this particular example, I want to extract value of all cells in region (C7, M33) and I defined a variable called value. Its value will be extracted from a cell located at $col,$row where $col and $row mean current column id and row id that the scanner is traversing at. The return is a list of object which contains user-defined variables.


In [1]:
import pprint
from etk.extractors.excel_extractor import ExcelExtractor
ee = ExcelExtractor()
variables = {
    'value': '$col,$row'
}
raw_extractions = ee.extract('alabama.xls', '16tbl08al', ['C,7', 'M,33'], variables)
pprint.pprint(raw_extractions[:10])  # print first 10


[{'value': 73},
 {'value': 1},
 {'value': 12},
 {'value': ''},
 {'value': 8},
 {'value': 52},
 {'value': 429},
 {'value': 146},
 {'value': 233},
 {'value': 50}]

Coordinate variable

Excel Extractor allows you to define multiple variables. This is useful if you want to extract the data from other cells which are associated with current cell. In this example, I also need column header (category) and county name of every cell in the region. It supports constant coordinate like ($B,$1) (which means the cell at column B row 1) or using + and - to caculate relative coordinate like ($B-1,$row+1) (which means the cell at column A and its row id is current row id + 1).


In [2]:
variables = {
    'value': '$col,$row',
    'county': '$B,$row',
    'category': '$col,$6'
}
raw_extractions = ee.extract('alabama.xls', '16tbl08al', ['C,7', 'M,33'], variables)
pprint.pprint(raw_extractions[:10]) # print first 10


[{'category': 'Violent\ncrime', 'county': 'Autauga', 'value': 73},
 {'category': 'Murder and\nnonnegligent\nmanslaughter',
  'county': 'Autauga',
  'value': 1},
 {'category': 'Rape\n(revised\ndefinition)1', 'county': 'Autauga', 'value': 12},
 {'category': 'Rape\n(legacy\ndefinition)2', 'county': 'Autauga', 'value': ''},
 {'category': 'Robbery', 'county': 'Autauga', 'value': 8},
 {'category': 'Aggravated\nassault', 'county': 'Autauga', 'value': 52},
 {'category': 'Property\ncrime', 'county': 'Autauga', 'value': 429},
 {'category': 'Burglary', 'county': 'Autauga', 'value': 146},
 {'category': 'Larceny-\ntheft', 'county': 'Autauga', 'value': 233},
 {'category': 'Motor\nvehicle\ntheft', 'county': 'Autauga', 'value': 50}]

Single variable

Besides the coordinate, the value of variables can also be a builtin variable (it only has $row and $col right now). This can be used for tracking provenance of extractions. Both row and column id here are presented in numeric form (base is 0).


In [3]:
variables = {
    'value': '$col,$row',
    'county': '$B,$row',
    'category': '$col,$6',
    'from_row': '$row',
    'from_col': '$col'
}
raw_extractions = ee.extract('alabama.xls', '16tbl08al', ['C,7', 'M,33'], variables)
pprint.pprint(raw_extractions[:10])  # print first 10


[{'category': 'Violent\ncrime',
  'county': 'Autauga',
  'from_col': 2,
  'from_row': 6,
  'value': 73},
 {'category': 'Murder and\nnonnegligent\nmanslaughter',
  'county': 'Autauga',
  'from_col': 3,
  'from_row': 6,
  'value': 1},
 {'category': 'Rape\n(revised\ndefinition)1',
  'county': 'Autauga',
  'from_col': 4,
  'from_row': 6,
  'value': 12},
 {'category': 'Rape\n(legacy\ndefinition)2',
  'county': 'Autauga',
  'from_col': 5,
  'from_row': 6,
  'value': ''},
 {'category': 'Robbery',
  'county': 'Autauga',
  'from_col': 6,
  'from_row': 6,
  'value': 8},
 {'category': 'Aggravated\nassault',
  'county': 'Autauga',
  'from_col': 7,
  'from_row': 6,
  'value': 52},
 {'category': 'Property\ncrime',
  'county': 'Autauga',
  'from_col': 8,
  'from_row': 6,
  'value': 429},
 {'category': 'Burglary',
  'county': 'Autauga',
  'from_col': 9,
  'from_row': 6,
  'value': 146},
 {'category': 'Larceny-\ntheft',
  'county': 'Autauga',
  'from_col': 10,
  'from_row': 6,
  'value': 233},
 {'category': 'Motor\nvehicle\ntheft',
  'county': 'Autauga',
  'from_col': 11,
  'from_row': 6,
  'value': 50}]

Wrap them up in ETK module and post processing

The below example shows how to use this extractor in ETK module. The extractor's variable syntax only supports using a single builtin variable or a coordinate. All the post processings need to be done after extraction.


In [4]:
import os, sys
from etk.etk import ETK
from etk.etk_module import ETKModule
from etk.extractors.excel_extractor import ExcelExtractor
from etk.utilities import Utility


class ExampleETKModule(ETKModule):
    """
    Abstract class for extraction module
    """
    def __init__(self, etk):
        ETKModule.__init__(self, etk)
        self.ee = ExcelExtractor()

    def document_selector(self, doc):
        return 'file_path' in doc.cdr_document

    def process_document(self, doc):
        """
        Add your code for processing the document
        """

        variables = {
            'value': '$col,$row',
            'county': '$B,$row',
            'category': '$col,$6',
            'from_row': '$row',
            'from_col': '$col'
        }

        raw_extractions = self.ee.extract(doc.cdr_document['file_path'], '16tbl08al', ['C,7', 'M,33'], variables)

        extracted_docs = []
        for d in raw_extractions:
            # post processing
            d['category'] = d['category'].replace('\n', ' ').strip()
            d['county'] = d['county'].replace('\n', ' ').strip()
            d['from_row'] = int(d['from_row'])
            d['from_col'] = int(d['from_col'])
            
            # create sub document
            d['doc_id'] = Utility.create_doc_id_from_json(d)
            extracted_docs.append(etk.create_document(d))

        return extracted_docs


# if __name__ == "__main__":
etk = ETK(modules=ExampleETKModule)
doc = etk.create_document({'file_path': 'alabama.xls'})
docs = etk.process_ems(doc)

for d in docs[1:11]:  # print first 10
    print(d.value)


{'value': 73, 'county': 'Autauga', 'category': 'Violent crime', 'from_row': 6, 'from_col': 2, 'doc_id': 'a37cb63313d1a36f869685c854402756bae56f300cedfbc84a286a9994aea32b'}
{'value': 1, 'county': 'Autauga', 'category': 'Murder and nonnegligent manslaughter', 'from_row': 6, 'from_col': 3, 'doc_id': 'ef3cdb52b90cf21b533d272f95bcf1a39374f659d29a2f1469770d5c04269aa6'}
{'value': 12, 'county': 'Autauga', 'category': 'Rape (revised definition)1', 'from_row': 6, 'from_col': 4, 'doc_id': '45ffcad797f25e0e4a8930a8814a5cb5c5b354d02f9ae672349ff3a9990b1f47'}
{'value': '', 'county': 'Autauga', 'category': 'Rape (legacy definition)2', 'from_row': 6, 'from_col': 5, 'doc_id': '1e5337d3a9d088eeaff36930d97134bd480a2e15430e4114f3359ff83d0d9992'}
{'value': 8, 'county': 'Autauga', 'category': 'Robbery', 'from_row': 6, 'from_col': 6, 'doc_id': '39fd8cf71ab4b7d4d55501d249603dbcae5049985731d2fee1e950d794693e87'}
{'value': 52, 'county': 'Autauga', 'category': 'Aggravated assault', 'from_row': 6, 'from_col': 7, 'doc_id': 'de8bc85d8fbc35e72bfffbd6c0e4f642c76496c7da76d3b2158044ff81edf280'}
{'value': 429, 'county': 'Autauga', 'category': 'Property crime', 'from_row': 6, 'from_col': 8, 'doc_id': 'eddcbeba7540667381f4a49f7a086f23ed08df185f4e2a93bbb9fd3932ae4619'}
{'value': 146, 'county': 'Autauga', 'category': 'Burglary', 'from_row': 6, 'from_col': 9, 'doc_id': 'c3887d1cd07c483160291f1cb345f5f4103109452948ec07dc840c576107e578'}
{'value': 233, 'county': 'Autauga', 'category': 'Larceny- theft', 'from_row': 6, 'from_col': 10, 'doc_id': 'a2cd5eb770ba9394ba2bf2d04637acc27a3ec76cfe5addd48365cd9950c99585'}
{'value': 50, 'county': 'Autauga', 'category': 'Motor vehicle theft', 'from_row': 6, 'from_col': 11, 'doc_id': '5d5721be75fddab89411a44a2a0390a73898184daa952119352453c68c8c3c6e'}